--- title: "Class Exercise 1" output: pdf_document: default html_document: default --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` ## R Markdown This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see . When you click the **Knit** button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this: ```{r cars} summary(cars) ``` ## Including Plots You can also embed plots, for example: ```{r pressure, echo=FALSE} plot(pressure) ``` Note that the `echo = FALSE` parameter was added to the code chunk to prevent printing of the R code that generated the plot. We will work with two separate datasets, `LakeHuron` and `Loblolly`. The dataset `LakeHuron` measures the water level of Lake Huron, one of the 5 Great Lakes, from 1875 to 1972. `Loblolly` measures the growth and age of different loblolly seed varieties. “Loblollies” are loblolly pines, a fast-growing species important to the commercial timber industry. Let’s look at `LakeHuron` first: ```{r} data(LakeHuron) LakeHuron ``` Does `LakeHuron` look like any of the data types we have already seen? Why or why not? `LakeHuron` is actually a time series data set with data type ts; try the following query: ```{r eval=FALSE} is.ts(LakeHuron) ``` We can confirm this by looking at its class attribute: ```{r eval=FALSE} attributes(LakeHuron) ``` Note that the attribute `tsp` tells us when the series starts, when it ends, and its increment (E.g., monthly data to be plotted on a yearly scale would have an increment of 12); the `class` attribute identifies the data type. Both of these attributes are used by the `plot` command to create a time series plot with the proper time scale. Note that the command itself is quite simple, but uses built-in rules for plotting a ts object: ```{r eval=FALSE} plot(LakeHuron) ``` Notice anything unusual in the time series? Lake Huron’s outlet, the St. Clair River, has been extensively dredged over the years, creating a long-term decrease in lake level that apparently leveled off decades ago. Concerns over fluctuating water levels of Lakes Michigan, Superior and Huron continue to this day. Let’s try plotting every 5 years’ data. Notice how easily multiple commands can be nested in R. ```{r eval=FALSE} plot(LakeHuron[seq(1,100,by=5)]) ``` This plot appears quite different from our earlier time series plot–what has been changed? The following command should save the 5-year subset as a time series object that can be more properly plotted. Are any issues still unaddressed? How would you resolve them? ```{r eval=FALSE} plot(ts(LakeHuron[seq(1,100,by=5)],start=1875,frequency=0.2)) ``` Next we will work with the `Loblolly` data set. ```{r eval=FALSE} data(Loblolly) Loblolly ``` What kind of data set is this? Is it a matrix or a data frame? ```{r eval=FALSE} is.matrix(Loblolly) is.data.frame(Loblolly) ``` Since it is not a matrix, you might anticipate that commands commonly used with matrices would not work. Try these: ```{r eval=FALSE} dim(Loblolly) Loblolly[1:5,] ``` Did they work? Clearly, some matrix commands can be applied to data frames. Next we confirm that `Loblolly$Seed` is a factor; here we first type the variable name by itself; does the way in which R prints the variable provide clues to the data type? ```{r eval=FALSE} Loblolly$Seed is.factor(Loblolly$Seed) ``` Now enter ```{r eval=FALSE} names(Loblolly) ``` These names are not particularly descriptive; we can change them (not in the datasets library, but in our local workspace) if we’d like, then construct a scatterplot for two of the variables. Are the resulting names more satisfactory? What might be a disadvantage? ```{r eval=FALSE} names(Loblolly)=c("Height (Ft)","Age (Yrs)","Seed Variety") names(Loblolly) Loblolly ``` **Tidyverse code** In our class exercises this semester, we will include tidyverse code for students who would like to explore **R** further. We will highlight code in the exercises that would be constructed differently if using tidyverse packages `dplyr` or `ggplot2`. As examples, we will replot the original `LakeHuron` time series and rename the variables in `Loblolly`. The plot is actually a poor introduction to `ggplot2`, since it relies on an automated function, `autoplot`, rather than the workhorse function, `ggplot`. In general, the tidyverse is set up to work with dataframes and tibbles (the tidyverse version of a dataframe), rather than a specialized object such as a time series. ```{r eval=FALSE} library(dplyr) library(ggplot2) library(ggfortify) autoplot(LakeHuron,ts.colour="red",ylab="Water Level",xlab="Year") ``` We only manipulated a couple defaults here; what do you think of the graph appearance as compared to the graph produced by the `plot` command? To subset data, we need to use features in `dplyr` (pronounced dee-plier). Not only are function names different in `dplyr`, but rather than using a set of parentheses, `dplyr` encourages the use of a *pipe*--the sequence of symbols %>%s. In the first example, we look at the first five rows of `Loblolly` using the `slice` command. We could also use the syntax `slice(Loblolly,1:5)`, but have chosen the pipe syntax instead. In the next column, we select every other row using a hybrid of `dplyr` commands--`n()` and `slice`--and regular **R** syntax--the `seq` command. And then we finish with an alternate method for renaming columns. What is your initial impression of the pipe operator? ```{r eval=FALSE} Loblolly %>% slice(1:5) Loblolly %>% slice(seq(1,n(),2)) Loblolly %>% rename("Height (Ft)"=height,"Age (Yrs)"=age,"Seed Variety"="Seed") ```